library(kableExtra)
library(tidyverse)
library(dplyr)
library(ggpubr)
library(lattice)
library(lme4)
[LRJ: Once we finalize the case study, we should come back and revisit the intro/motivation section]
There are over 1400 public schools in Maryland and there are also outstanding students in each school. The federal Every Student Succeeds Act (ESSA) prompted states to develop long term plans to improve schools through accountability and innovation, which sets Maryland’s schools on the path to continuous improvement. Maryland Report Card website aims to share the most current information available to help stakeholders understand and measure student achievement in all 24 local school systems. Here is a message from the State Superintendent of Schools that enables you to know more about how it functions.
Our analysis is inspired by this plan - figure out how well schools were performing and how different factors influenced their performance. Once we indentify schools that need improvements and influential factors, both Maryland’s government and local organizations can prompt actions and provide necessary support, in a way that is understanable and reliable.
The libraries used in this study are listed in the following table, along with their purpose in this particular case study:
| Library | Purpose |
|---|---|
kableExtra |
Helps with building common complex tables and manipulating table styles; creates nice-looking HTML tables |
tidyverse |
A coherent system of packages for data manipulation, exploration and visualization |
dplyr |
Helps you solve the most common data manipulation challenges |
In order to run this code please ensure you have these packages installed.
The learning objectives for this case study include:
The data files were downloaded from the Maryland Report Card website. Data files can be found by clicking Data Download at the bottom of the web page.
At this link, you can find data for all 1400+ schools in Maryland from 2003 to 2018. For this case study, we will focus on data on high schools from 2017. For this analysis, we will extract data from 5 of the files for that year:
Cohort_Grad_Rate_2017.csv: graduation rates for each schoolPARCC_2017.csv: school performance on the Partnership for Assessment of Readiness for College and Careers (PARCC) learning assessmentsSpecial_Services_2017.csv: school participation in the Free and Reduced Meals (FARMs) program and school participation in Special Education programsStudent_Mobility_2017.csv: percentage of students that either enter or withdraw from the schoolWealth_Expenditures_Data_2017.csv: county level wealth and expenditure data for educationAfter downloading the 5 data files, we can write a script to import all the files in the same folder into R relatively easily. In this case, we’ve saved all files in a folder named data2017:
Knowing which environment you are in is necessary since it enables R to find these files quickly and accurately. Since our data files are in the folder data2017, we will need to specify this in the path to each file.
Instead of importing each file with an individual line of code, we can use the function lapply() to import them all simultaneously. To do this, first we make a vector of all of the names for the files we want to import. The list.files() function will make a vector of all files in a folder that match a given pattern. Here we want all .csv files, so we can use pattern="*.csv" to specify all files that end with a .csv. The * character means any set of characters can be in that location in the file name.
files = list.files(path='./data2017',pattern="*.csv")
files
## [1] "Cohort_Grad_Rate_2017.csv" "PARCC_2017.csv"
## [3] "Special_Services_2017.csv" "Student_Mobility_2017.csv"
## [5] "Wealth_Expenditures_Data_2017.csv"
Next we create a vector of paths to these files by pasting the folder name, data2017, to the path for each file.
filePaths = paste0("./data2017/", files)
filePaths
## [1] "./data2017/Cohort_Grad_Rate_2017.csv"
## [2] "./data2017/PARCC_2017.csv"
## [3] "./data2017/Special_Services_2017.csv"
## [4] "./data2017/Student_Mobility_2017.csv"
## [5] "./data2017/Wealth_Expenditures_Data_2017.csv"
Finally, we can read in each data file. The lapply() function will apply the same function repeatedly to each item in a vector and store the results in a list. The function we want to apply repeatedly is the read_csv() function; by giving the vector of file paths and this function to lapply(), we can ready in all 5 data files at once.
[LRJ: I think we want to use the read_csv() function here instead of read.csv() to stay within the tidyverse. However, when I make this change it changes some of the names of the variables, since it doesn’t replace spaces with dots. I do want to make this change, but then will need you to update the later code so that the names will work and the screenshots are accurate.]
data <- lapply(filePaths, read.csv, header=TRUE)
#data <- lapply(filePaths, read_csv)
Now we have a data object that consists of a list of data sets. We can access an individual data set with the [[]] operator.
#data
#data[[1]]
head(data[[1]])
## Class.of.Year LEA LEA.Name School.Number
## 1 2017 1 Allegany 405
## 2 2017 1 Allegany 405
## 3 2017 1 Allegany 601
## 4 2017 1 Allegany 601
## 5 2017 1 Allegany 606
## 6 2017 1 Allegany 606
## School.Name Cohort Grad.Rate
## 1 Fort Hill High 4-year adjusted cohort 82.86
## 2 Fort Hill High 3-year adjusted cohort <= 5.00
## 3 Center for Career & Technical Education 4-year adjusted cohort >= 95.00
## 4 Center for Career & Technical Education 3-year adjusted cohort <= 5.00
## 5 Allegany High 4-year adjusted cohort 90.78
## 6 Allegany High 3-year adjusted cohort <= 5.00
## Diplomas.Earned Adjusted.Cohort.Count Create.Date
## 1 145 175 20180117
## 2 * 163 20180117
## 3 * 132 20180117
## 4 * 137 20180117
## 5 128 141 20180117
## 6 * 133 20180117
Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham
[LRJ: is the quote from a book? then we need to cite the book and not just the author]
A large part of the case study involves data wrangling to get these five data files into a single data set that we can analyze. To do this, we will alternate between viewing the data and performing data wrangling steps. In this case-study, we will provide screenshots of the data; you can view the data in the same way by using the View() function on the data frame we are working with.
The first data file (Cohort_Grad_Rate_2017.csv) records the percentage of students in each school who received a Maryland high school diploma during 2017. Let’s look at the data frame that comes from this file:
View(data[[1]])
This data frame has a total of 579 rows with 10 variables. Looking at this data frame in R, we see there are several issues we will need to address, such as possible missing values designated by *. We will deal with these issues in later steps.
Our first step is to subset to only school-level data. We notice that if School.Number equals ‘A’, then this row corresponds to county level data of all of the high-schools in the county. We also notice that there are two different graduation rates listed for different cohorts of students:
Since high school students typically graduate after four years, it makes more sense to use the 4-year adjusted cohort as our graduation rate variable. We also see this by looking at the values for these two rates – the 3-year rate is generally quite low while the 4-year rate is generally in the 80-100% range we would expect.
Function filter() and select() in package dplyr will help us to finish the first wrangling step.
df_grad <- data[[1]] %>%
filter( School.Number != 'A',
Cohort == '4-year adjusted cohort') %>%
select(LEA.Name, School.Number,
School.Name, Grad.Rate)
colnames(df_grad) <- c('county', 'sch.num',
'sch', 'grad.rate')
See what we get! Only school level information and useful variables are kept, leaving us a neater dataset. Next, missing values should be fixed.
There are a lot of missing values in the dataset, which would influence further analysis if we leave them there. From the first plot, missing values exist in different forms:
'>= 95.00','<= 5.00' : The graduation rate is higher than 95% or lower than 5%.'*' : This schools’ graduation rate is simply not present in the data.Use function summary() to produce the descriptive statistics of these missing values.
summary(df_grad[2:4])
## sch.num sch
## 301 : 4 Chesapeake High : 2
## 102 : 3 Frederick Douglass High : 2
## 207 : 3 Northwestern High : 2
## 405 : 3 Aberdeen High : 1
## 705 : 3 Academy for College and Career Exploration: 1
## 108 : 2 Academy of Health Sciences at PGCC : 1
## (Other):246 (Other) :255
## grad.rate
## >= 95.00: 57
## * : 27
## <= 5.00 : 9
## 84.02 : 2
## 84.4 : 2
## 86.21 : 2
## (Other) :165
Pay attention to the first kind of missing value. It is impossible to calculate the exact graduate rate because of the lack of information, so we decide to use gsub() function to replace incomplete values. gsub() function replaces all matches of a string. Elements of string vectors which are not substituted will be returned unchanged.
Without knowledge of exact value, 2.5 and 97.5 are used to replace ‘<= 5.00’ and ‘>= 95.00’ respectively. By the way, graduation rate lower than 5% sounds abnormal - let’s look at these school before we replace them.
head(df_grad[df_grad$grad.rate == '<= 5.00', ])
## county sch.num sch
## 23 Baltimore County 54 Extended Day Learning Program
## 24 Baltimore County 58 Home Assignments-Secondary
## 25 Baltimore County 69 Catonsville Center for Alternative Studies
## 26 Baltimore County 72 Rosedale Center
## 28 Baltimore County 77 BCDC Educational Center
## 157 Montgomery 916 Rock Terrace School
## grad.rate
## 23 <= 5.00
## 24 <= 5.00
## 25 <= 5.00
## 26 <= 5.00
## 28 <= 5.00
## 157 <= 5.00
Search Extended Day Learning Program and Home Assignments-Secondary, they provides special education and special needs school programs, which may be designed for adults - explaining the low graduation rate.
df_grad$grad.rate<- gsub('<= 5.00','2.5', df_grad$grad.rate)
df_grad$grad.rate <- gsub('>= 95.00','97.5', df_grad$grad.rate)
The second type of missing value (’*’) is confusing, let’s extract them out and have a look:
head(df_grad[df_grad$grad.rate == '*', ])
## county sch.num sch grad.rate
## 6 Anne Arundel 1274 Marley Glen School *
## 16 Anne Arundel 3414 Ruth Parker Eason School *
## 21 Anne Arundel 4304 Central Special School *
## 27 Baltimore County 75 Crossroads Center *
## 29 Baltimore County 111 Maiden Choice School *
## 41 Baltimore County 922 Ridge Ruxton *
If we search Marley Glen School, we may found it mainly provides program for students with disabilities. Also, Ruth Parker Eason School provides a special education program for students with moderate to severe disabilities. So we guess schools with ’*‘graduation rate mainly provide education for students with moderate to severe disabilities. We can either delete it or replace them as ’NA’. Considering we need to merge multiple files in later steps, such kind of information may be dropped out automatically. Ths most appropriate method for us is replacing them as ‘NA’ and deal with them all together.
df_grad$grad.rate <- na_if(df_grad$grad.rate, '*')
df_grad$grad.rate <- as.numeric(df_grad$grad.rate)
Our last step is to extract information from several files and merge them together, requiring a key that specifies each observation uniquely. Possible choices in present dataset can be sch.num or sch.name but we need to check whether each of them appear once in the dataset. If not, they are not suitable choice because of the lack of uniqueness.
summary(df_grad[,c('sch.num', 'sch')])
## sch.num sch
## 301 : 4 Chesapeake High : 2
## 102 : 3 Frederick Douglass High : 2
## 207 : 3 Northwestern High : 2
## 405 : 3 Aberdeen High : 1
## 705 : 3 Academy for College and Career Exploration: 1
## 108 : 2 Academy of Health Sciences at PGCC : 1
## (Other):246 (Other) :255
From the summary, whether sch.num or sch.name, the frequency of some values is larger than 1. Therefore it is definitely important to define a unique element by ourselves.
df_grad <- df_grad %>%
within( id <- paste(county, sch, sep = '-'))
The combination of county and sch generates a new variable - id. To check its uniqueness, we use function table(), which builds a contingency table of the counts at each combination of factor levels. The as.data.frame() function converts the array-based representation of a contingency table to a data frame containing two variable: each factor Var1and its frequency Freq.
tab <- as.data.frame(table(df_grad$id))
tab[tab$Freq > 1,]
## [1] Var1 Freq
## <0 rows> (or 0-length row.names)
Up to now, a unique key element id is generated and we finally create the ideal dataset - a tidy dataset.
Partnership for Assessment of Readiness for College and Careers (PARCC) reflects schools’ academic achievements through assessing the performance of students on state standardized tests, such as English and Math. It not only provides information about students mastery of state standards, but also offers teachers and parents with timely information to inform instruction and how to provide support. From this file, we want to extract the percentage of students scoring ‘high performance’, which works as the target variable is data analysis part.
colnames(data[[2]])
## [1] "Academic.Year"
## [2] "LEA.Number"
## [3] "LEA.Name"
## [4] "School.Number"
## [5] "School.Name"
## [6] "Assessment"
## [7] "Tested.Count"
## [8] "Level.1..Did.not.yet.meet.expectations...Count"
## [9] "Level.1..Did.not.yet.meet.expectations...Percent"
## [10] "Level.2..Partially.met.expectations...Count"
## [11] "Level.2..Partially.met.expectations...Percent"
## [12] "Level.3..Approached.expectations...Count"
## [13] "Level.3..Approached.expectations...Percent"
## [14] "Level.4..Met.expectations...Count"
## [15] "Level.4..Met.expectations...Percent"
## [16] "Level.5..Exceeded.expectations...Count"
## [17] "Level.5..Exceeded.expectations...Percent"
## [18] "Create.Date"
PARCC is a large file with 8989 observations and 18 variables. We notice there are five levels of performance indicators, ranging from Level 1 to Level 5. First, we use percentage information rather than count information. Second, we will create a new variable, pro, that indicates the percentage of students performing at the ‘met expectations’ and ‘exceeded expectations’ levels.
Similiar to graduation rate file, only school level information and necessary variables (LEA.Name, School.Number, School.Name, Assessment and five level indicators) will be kept and renamed.
df_parcc <- data[[2]] %>%
filter( School.Number != 'A' ) %>%
select(3,4,5,6,9,11,13,15,17)
colnames(df_parcc) <- c('county','sch.num',
'sch', 'subject','L1',
'L2','L3','L4','L5')
Variable subject reflects the kind of subject test. English/Language Arts Grade 10, Algebra 1 and 2 are filtered out by function filter() in package dplyr. You are encouraged to keep other assessments that you are interested in.
df_parcc <- df_parcc %>%
filter(subject %in% c('English/Language Arts Grade 10','Algebra 1','Algebra 2'))
Missing value is the toughest problem in this wrangling part. Let’s see its summary first.
summary(df_parcc[, 5:9])
## L1 L2 L3 L4 L5
## <= 5.0 :273 <= 5.0 :176 <= 5.0 :127 <= 5.0 :151 <= 5.0 :597
## 17.6 : 7 28.6 : 9 33.3 : 10 33.3 : 7 6.8 : 7
## 5.5 : 7 15.2 : 8 22.7 : 8 75 : 7 19.8 : 5
## 15.8 : 6 16.7 : 7 25 : 8 11.4 : 5 5.6 : 5
## 5.6 : 6 17.9 : 7 14.3 : 7 17.6 : 5 5.1 : 4
## 6.7 : 6 18.2 : 7 19.7 : 7 34.1 : 5 8.6 : 4
## (Other):569 (Other):660 (Other):707 (Other):694 (Other):252
There is only one kind of missing value: <=5.0 in these 5 levels indicators. We will replace it as NA, then transform them to numerical data.
df_parcc[,5:9] <-
lapply(df_parcc[,5:9], function(x) as.numeric(as.character(x)))
We only care about the proportion of L4 and L5 but NAs are distributed differently. To simply wrangling steps, we reduce these five levels into two new levels : ‘Level weak’ (L1, L2 and L3) and ‘Level excellent’ (L4 and L5). Then deal with NAs in three ways. If values of ‘Level excellent’ are complete, then pro equals to sum of L4 and L5. Such as Algebra 1 assessment of ‘Washington Middle’, its pro should be \(85.5 + 9.7 = 95.2\); If values of ‘Level weak’ all exist, pro equals to 100 percent minus the sum of L1, L2 and L3. For example, the pro of Algebra 1 for ‘Fort Hill High’ equals to \(100-23.9-44.6-21.7 = 9.8\);
df_parcc$pro <- NA
indx1 <- !is.na(df_parcc$L4) & !is.na(df_parcc$L5)
df_parcc[indx1, 'pro'] <- rowSums(df_parcc[indx1,8:9])
sum(is.na(df_parcc$pro))
## [1] 597
indx2 <- !is.na(df_parcc$L1) & !is.na(df_parcc$L2) & !is.na(df_parcc$L3)
df_parcc[indx2, 'pro'] <- 100-rowSums(df_parcc[indx2,5:7])
sum(is.na(df_parcc$pro))
## [1] 203
Function is.na() indicates which elements are missing and returns a boolean index of the same shape as the original data frame. df_parcc[indx1,] and df_parcc[indx2,] specify observations have complete information of ‘Level excellent’ and ‘Level weak’ respectively. sum(is.na(df_parcc$pro)) shows the counts of NA in variable pro. Undoubtedly, our methods reduce NAs greatly after we apply corresponding methods mentioned above.
If missing values exist both in these two levels, we will first use 100 percent minus existing values. Then divide the difference to the number of missing value and replace NA with this quotient. For instance - ‘Brooklyn Park Middle’, there are three NAs, so \(NA = (100-20.8-72.7)/3=2.2\), and pro equals to \(72.7+2.2=74.9\). The reason we don’t replace these NAs as 2.5 directly is that there is a restriction we don’t want to obey - the sum of 5 levels equals to 100 percent. It is more reasonable to assume missing values are uniformly distributed than replace them with a fixed value directly.
indx3 <- is.na(df_parcc$pro)
na_sum <- 100-rowSums(df_parcc[indx3, 5:9], na.rm = TRUE)
n <- rowSums(is.na(df_parcc[indx3, 5:9]))
na <- round(na_sum/n,1)
for (i in 1:length(indx3)){
df_parcc[indx3, 5:9][i,which(is.na(df_parcc[indx3, 5:9][i,]))] <- na[i]
}
df_parcc[indx3, 'pro'] <- rowSums(df_parcc[indx3,8:9])
sum(is.na(df_parcc$pro))
## [1] 0
What the above script does? First, rows containing NA in pro are filtered out and recorded as indx3, totally 874 observations. Then, for each row, we calculate the difference of 100 percent and sum of existing values - na_sum, and the number of missing value - n. Dividing na_sum by n gives the quotients na. All of these three variables are vectors which length are the same as indx3 - 203. The last step is replacement and we write a for loop here. For each row (df_parcc[indx3, 5:9][i,]), columns containing NA are selected out with the help of function is.na(). Applying function which() enables us to get the corresponding columns indexs. Once we determine the positions of NA, we can assign what we calculated before - na to these NAs.
Now, we have 0 NAs, which proves the validity and feasibility of our methods. The following step is to create a unique id - still use the combination of county name and school name. And we only keep new variable pro in the final dataset
pac <- df_parcc %>%
within( id <- paste(county, sch, sep = '-')) %>%
select(county, sch.num, sch, subject, pro, id)
This is how the final parcc dataset looks like:
What’s more, the last two rows - The Seed School of Maryland, lacks information about which county it belongs to.
tail(pac)
## county sch.num sch
## 869 Baltimore City 454 Carver Vocational-Technical High
## 870 Baltimore City 480 Baltimore City College
## 871 Baltimore City 480 Baltimore City College
## 872 Baltimore City 480 Baltimore City College
## 873 SEED 1000 The Seed School of Maryland
## 874 SEED 1000 The Seed School of Maryland
## subject pro
## 869 Algebra 2 1.0
## 870 English/Language Arts Grade 10 45.1
## 871 Algebra 1 27.7
## 872 Algebra 2 16.7
## 873 English/Language Arts Grade 10 38.4
## 874 Algebra 1 9.8
## id
## 869 Baltimore City-Carver Vocational-Technical High
## 870 Baltimore City-Baltimore City College
## 871 Baltimore City-Baltimore City College
## 872 Baltimore City-Baltimore City College
## 873 SEED-The Seed School of Maryland
## 874 SEED-The Seed School of Maryland
After check online, we find it is not a traditional high school but a boarding school that draws students from all over the state of Maryland. So it wouldn’t be right to assign them to any particular county. We choose to exclude them from our analysis.
pac <- pac %>%
filter( county != 'SEED')
The proportion of high performance in these assessment (\(p\)) plays an role of response variable in our later analysis. Thus, splicting them out first makes further analysis convenient.
\[ p_{ela} = \beta_0+\beta_{1} x_1+ \beta_{2} x_2 \]
df_ela <- pac[grep("English/Language Arts Grade 10", pac$subject), ]
df_alg1 <- pac[grep("Algebra 1", pac$subject), ]
df_alg2 <- pac[grep("Algebra 2", pac$subject), ]
dim(df_ela) #size of df_ela
## [1] 233 6
dim(df_alg1) #size of df_alg1
## [1] 472 6
dim(df_alg2) #size of df_alg2
## [1] 167 6
It’s wired! The size of df_alg1 is larger than the other two so let’s look at it:
head(df_alg1)
## county sch.num sch subject pro
## 2 Allegany 405 Fort Hill High Algebra 1 9.8
## 4 Allegany 406 Washington Middle Algebra 1 95.2
## 5 Allegany 504 Braddock Middle Algebra 1 84.4
## 6 Allegany 601 Center for Career & Technical Education Algebra 1 7.6
## 9 Allegany 606 Allegany High Algebra 1 8.4
## 11 Allegany 802 Westmar Middle Algebra 1 87.9
## id
## 2 Allegany-Fort Hill High
## 4 Allegany-Washington Middle
## 5 Allegany-Braddock Middle
## 6 Allegany-Center for Career & Technical Education
## 9 Allegany-Allegany High
## 11 Allegany-Westmar Middle
It is because it contains middle school information! Actually for Algebra 1, students will take the exam the year they learn it so that’s why both middle school and high school have this test. Besides, pro of middle schools are much higher than that of high schools. It is because students who take Algebra1 test in middle schoos are more likely to be good at this subject. This fact makes them incompariable with students who take Algebra 1 test in high schools. pro for high school can only reflect part of the academic achievements - those who may not be good at Algebra 1. Pay attention to this point in later analysis.
How to extract the high school information? Here variable id comes to work! df_grad only includes information for high schools. So if obseravations in df_alg1 also exist in df_grad, it corresponds to high school, otherwise to middle school. The unique variable id enables us to check the co-existness with the help of function filter().
df_alg1 <- df_alg1 %>%
filter(id %in% df_grad$id)
dim(df_alg1)
## [1] 226 6
Special_Services_2017.csv records the number and percentage of students who applied different special service and students approved through direct certification. Such variables make us know more about the school quality and students support. Here we choose service FARMS: receive free or reduced price meals.
What the following script does is similiar to previous wrangling steps except one step. Since in special service file there is a vairable called School.Type, we can filter high school information easily if we add a new condition School.Type == 'High'.
df_spc <- data[[3]] %>%
filter( School.Number != 'A', School.Type == 'High' ) %>%
within( id <- paste(LEA.Name, School.Name, sep = '-')) %>%
select(LEA.Name, School.Number, School.Name,
FARMS.Pct, id)
colnames(df_spc) <- c('county', 'sch.num', 'sch', 'farms', 'id')
summary(df_spc)
## county sch.num
## Baltimore City :41 0301 : 4
## Prince George's :37 0102 : 3
## Baltimore County:33 0207 : 3
## Montgomery :31 0405 : 3
## Anne Arundel :18 0705 : 3
## Howard :14 0108 : 2
## (Other) :94 (Other):250
## sch farms
## Chesapeake High : 2 * : 12
## Frederick Douglass High : 2 <= 5.0 : 7
## Northwestern High : 2 35.1 : 3
## Aberdeen High : 1 55.4 : 3
## Academy for College and Career Exploration: 1 12.5 : 2
## Academy of Health Sciences at PGCC : 1 13.8 : 2
## (Other) :259 (Other):239
## id
## Length:268
## Class :character
## Mode :character
##
##
##
##
There is only one problem in this dataset, also it is an old problem - missing value.
Similiarly, let’s have a look on what these schools with ’*’ in farms look like:
df_spc[df_spc$farms == '*',]
## county sch.num sch farms
## 6 Anne Arundel 1274 Marley Glen School *
## 10 Anne Arundel 2233 Anne Arundel Evening High *
## 28 Baltimore County 0077 BCDC Educational Center *
## 57 Calvert 0206 Calvert Country School *
## 70 Carroll 0712 Carroll Springs School *
## 72 Carroll 0717 Post Secondary Program *
## 123 Howard 0522 Cedar Lane Special Center *
## 156 Montgomery 0799 Stephen Knolls School *
## 159 Montgomery 0951 Longview School *
## 196 Prince George's 2217 Incarcerated Youth Center (JACS) *
## 220 Wicomico 0520 Wicomico County Evening High *
## 267 Baltimore City 0884 Eager Street Academy *
## id
## 6 Anne Arundel-Marley Glen School
## 10 Anne Arundel-Anne Arundel Evening High
## 28 Baltimore County-BCDC Educational Center
## 57 Calvert-Calvert Country School
## 70 Carroll-Carroll Springs School
## 72 Carroll-Post Secondary Program
## 123 Howard-Cedar Lane Special Center
## 156 Montgomery-Stephen Knolls School
## 159 Montgomery-Longview School
## 196 Prince George's-Incarcerated Youth Center (JACS)
## 220 Wicomico-Wicomico County Evening High
## 267 Baltimore City-Eager Street Academy
It seems that they are still not ‘normal’ public schools so we change ’*’ to NA.
After replacing ‘<= 5.0’ to ‘2.5’ with function gsub(), if you apply as.numeric() function, ’*’ will be transformed to NA automatically.
df_spc$farms <- gsub('<= 5.0','2.5', df_spc$farms)
df_spc$farms <- as.numeric(df_spc$farms)
This is how the final special serivce dataset looks like:
Student_Mobility_2017.csv includes information of the movement of students from one school to another during the school year. There are 3 types of mobility provided: total student mobility, entry mobility, and exit mobility. To some extent, exit mobility reflects students’ satisfaction with their schools so we would like to use it as variable withdraw in our analysis.
As usual, wrangling steps are high school observation selection, unique id creation, feature selection and missing value transformation.
df_mob <- data[[4]] %>%
filter( School.Number != 'A', School.Type == 'High' )%>%
within( id <- paste(LEA.Name, School.Name, sep = '-')) %>%
select(LEA.Name, School.Number,
School.Name, Withdrawals.Pct, id)
colnames(df_mob) <- c('county', 'sch.num', 'sch', 'withdraw', 'id')
summary(df_mob)
## county sch.num
## Baltimore City :41 301 : 4
## Prince George's :37 102 : 3
## Baltimore County:33 207 : 3
## Montgomery :31 405 : 3
## Anne Arundel :18 705 : 3
## Howard :14 108 : 2
## (Other) :94 (Other):250
## sch withdraw
## Chesapeake High : 2 <= 5.0 : 77
## Frederick Douglass High : 2 >= 95.0: 9
## Northwestern High : 2 5.5 : 8
## Aberdeen High : 1 8.6 : 5
## Academy for College and Career Exploration: 1 10.3 : 3
## Academy of Health Sciences at PGCC : 1 10.9 : 3
## (Other) :259 (Other):163
## id
## Length:268
## Class :character
## Mode :character
##
##
##
##
df_mob$withdraw <- gsub('<= 5.0','2.5', df_mob$withdraw)
df_mob$withdraw <- gsub('>= 95.0','97.5', df_mob$withdraw)
df_mob$withdraw <- as.numeric(df_mob$withdraw)
This is how the final mobility dataset looks like:
Wealth_Expenditures_Data_2017.csv contains information of different kinds of investments as well as wealth per pupil. What makes it special is that its size (11 variables and 25 observations) is obsivously smaller than previous files. Let’s have a look on it:
It turns out that this file only contains county level data! The last row is state level ‘All Public School’ which needs to will be deleted when we merge datasets. Then, we would like to keep variable county which will work as the key in later merge steps. Besides, a new variable exp will be created by the formula:
\[exp = \frac{Wealth.Per.Pupil}{Expenditures.Per.Pupil}\]
The quotient of Wealth.Per.Pupil and Expenditures.Per.Pupil captures the image of each county’s financial condition. Function mutate() in package dplyr is a nice tool to adds new variables. Thus, wrangling steps are new variable creation, feature selection and rename.
df_exp <- data[[5]] %>%
mutate(exp = round(Wealth.Per.Pupil/Expenditures.Per.Pupil, 1)) %>%
select(LEA.Name, exp)
colnames(df_exp) <- c('county','exp')
We notice the last row is information about ‘All Public School’, but it will be dropped automatically when we merge it.
Now we have totally 8 datasets, requiring combination to answer the questions that we are interested in. They are called relational data because it is their relations, not just the individual datasets, that are important. Relations are defined between a pair in these datasets. In out case, relations presented in the form that the same variable exists in multiple datasets. The following plot describes how our datasets connect:
We can see that variable county exists in all datasets, variable sch, sch.name and id exists in all datasets except df_exp. In addition, each dataset contain its unique variables.
To work with relational data, mutating joins is pretty powerful which combines variables from two datasets. It first matches observations by their keys, then copies across variables from one dataset to the other. keys is a variable that uniquely indentifies an observation so it can connect each pair of datasets. This is the meaning of the existence of variable id - sufficiently indentifies each observation in our datasets.
mutating joins includes inner joins and outer joins. An inner join keeps observations that appear in both datasets while outer join keeps observations that appear in at least one of the datasets. There are three types of outer joins: left join, right join and full join. The graphical explanation is given below:
Inner join:
Outer join:
To be more specific, a table is provided to show detailed descriptions.
| Types | How it works | Implementation in dplyr |
|---|---|---|
| inner join | keeps observations that appear in both datasets | inner_join() |
| left join | keeps all observations in x but drops bbservations in y but not in x |
left_join() |
| right join | keeps all observations in y but drops bbservations in x but not in y |
right_join() |
| full join | keeps all observations in x and y |
full_join() |
Left join is the most common and popular method, especially when we only need part of variables from another dataset. It preserves the original observations even when there isn’t a match - just make it as NA. This is also our best choice and we will show how it works in our case.
The final dataset should include predict variables grad.rate, withdraw, farms, exp and target variable pro. Three datasets with this structure will be created for df_ela, df_alg1 and df_alg2 respectively. As mentioned, we would focus on left join with the help of function left_join(). Let’s see an example for df_ela first:
temp <- df_ela %>%
select(id, county, pro) %>%
left_join(df_grad[, c('grad.rate', 'id')], by = 'id')
head(temp)
## id county pro grad.rate
## 1 Allegany-Fort Hill High Allegany 40.9 82.86
## 2 Allegany-Allegany High Allegany 52.9 90.78
## 3 Allegany-Mountain Ridge High School Allegany 54.2 87.82
## 4 Anne Arundel-Glen Burnie High Anne Arundel 37.9 90.30
## 5 Anne Arundel-North County High Anne Arundel 49.5 86.18
## 6 Anne Arundel-Severna Park High Anne Arundel 83.1 97.50
tail(temp)
## id
## 228 Baltimore City-Augusta Fells Savage Institute of Visual Arts
## 229 Baltimore City-Coppin Academy
## 230 Baltimore City-Renaissance Academy
## 231 Baltimore City-Frederick Douglass High
## 232 Baltimore City-Carver Vocational-Technical High
## 233 Baltimore City-Baltimore City College
## county pro grad.rate
## 228 Baltimore City 2.4 54.20
## 229 Baltimore City 9.3 91.57
## 230 Baltimore City 3.2 52.00
## 231 Baltimore City 3.2 70.10
## 232 Baltimore City 2.8 76.17
## 233 Baltimore City 45.1 97.50
Look! grad.rate in df_grad was added to df_ela successfully. Option by= enables you to specify the keys you would like to use. Then, we repeat this merge step for df_mob, df_spc and df_exp.
| Dataset | Variable Kept | Keys |
|---|---|---|
| df_grad | grad.rate: 4-year graduation rate |
id |
| df_mob | withdraw: students’ exit mobility |
id |
| df_spc | farms: percentage of students reciving free/reduced price meals |
id |
| df_exp | exp: quotient of wealth and expenditures |
county |
temp1 <- df_ela %>%
select(id, county, pro) %>%
left_join(df_grad[, c('grad.rate', 'id')], by = 'id') %>%
left_join( df_mob[, c('withdraw', 'id')], by = 'id') %>%
left_join( df_spc[, c('farms', 'id')], by = 'id') %>%
left_join(df_exp, by = 'county')
Instead of repeating by ourselves, the R base package provides a function Reduce(), which can come in handy. Reduce() takes a function \(f\) of two arguments and a list or vector \(x\) which is to be reduced using \(f\). The function is first called on the first two components of \(x\), then with the result of that as the first argument and the third component of \(x\) as the second argument, then again with the result of the second step as first argument and the fourth component of \(x\) as the second argument etc. The process is continued until all elements of \(x\) have been processed.
L_ela <- list(df_ela[, c('id', 'county', 'pro')],
df_grad[, c('id','grad.rate')],
df_mob[, c('id','withdraw')],
df_spc[, c('id','farms')], df_exp)
temp2 <- Reduce(function(x,y) left_join(x, y,
by = colnames(y)[1]), L_ela)
# `by = colnames(y)[1])` tells how to find the *keys* in each left_join.
The above script shows how Reduce() function works in out case. First, these five files are save in list L_ela. Then function left_join() was applied one by one in list L_ela with function Reduce(). temp2 is the same dataset as temp1. So, let’s write it as a function and apply it to df_alg1 and df_alg2.
func_merge <- function(data){
L <- list(data[, c('id', 'county', 'pro')],
df_grad[, c('id','grad.rate')],
df_mob[, c('id','withdraw')],
df_spc[, c('id','farms')], df_exp)
final_data<- Reduce(function(x,y) left_join(x, y,
by = colnames(y)[1]), L)
return(final_data)
}
df_ELA <- func_merge(df_ela)
df_ALG1 <- func_merge(df_alg1)
df_ALG2 <- func_merge(df_alg2)
Wonderful! Finally we get the ideal datasets: df_ELA, df_ALG1 and df_ALG2.
# Boxplot
p1 <- ggplot(df_ALG1, aes(x = reorder(county, pro),
y = pro,fill = county)) + geom_boxplot()+
guides(fill = FALSE) + theme_bw() +
theme(plot.title = element_text(hjust = 0.5 ),
axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5)) +
labs( x = 'County', y = 'Distribution',
caption = 'proportion of high performance for Algebra 1 in year 2017')
ALG1_sum <- df_ALG1 %>%
group_by(county) %>%
summarise(mean = mean(pro),
sd = sd(pro))
# Line plot
p2 <- ggplot(ALG1_sum, aes( reorder(factor(county), mean), mean, group = 1)) +
geom_line(alpha = 1/3) +
geom_pointrange(aes(ymax = mean + sd, ymin = mean - sd)) +
labs(x = 'County', caption = 'Error bars') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5 ),
axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))
ggarrange(p1, p2, ncol = 2)
distplot <- function(data, variable){
p <- ggplot(data, aes(x = reorder(county, variable),
y = variable, fill = county)) + geom_boxplot()+
guides(fill = FALSE) + theme_bw() +
theme(plot.title = element_text(hjust = 0.5 ),
axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))+
xlab("County") + ylab("Distribution")
return(p)
}
p3 <- distplot(df_ALG1, df_ALG1$pro)
p4 <- distplot(df_ALG2, df_ALG2$pro)
p5 <- distplot(df_ELA, df_ELA$pro)
ggarrange(p3, p4, p5, nrow = 2, ncol = 2)
p6 <- ggplot(data = df_ALG1, aes(x = grad.rate, y = pro))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
theme(legend.position = "none")
p7 <- ggplot(data = df_ALG1, aes(x = grad.rate, y = pro,group = county))+
facet_grid( ~ county, switch = 'x')+
geom_point(aes(colour = county))+
geom_smooth(method = "lm", se = FALSE, aes(colour = county))+
theme(strip.text=element_text(angle = 90))+
theme(legend.position = "none") +
theme(axis.text = element_blank()) +
theme(axis.ticks = element_blank())
ggarrange(p6, p7, nrow = 2)
#xyplot(pro~grad.rate |county, df_ALG1,
# type = c("p", "smooth"))
To evaluate influence of each predictor variable, linear regression may first comes to your mind. Let’s fit a simple linear regression model here:
m <- lm(pro ~ grad.rate + withdraw + farms + exp, data = df_ELA)
summary(m)
##
## Call:
## lm(formula = pro ~ grad.rate + withdraw + farms + exp, data = df_ELA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.674 -7.764 -0.321 6.993 55.423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.26712 12.14055 3.729 0.000245 ***
## grad.rate 0.19937 0.11298 1.765 0.079011 .
## withdraw -0.13216 0.11418 -1.157 0.248336
## farms -0.82627 0.05685 -14.533 < 2e-16 ***
## exp 0.50565 0.08758 5.774 2.62e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.62 on 220 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7731, Adjusted R-squared: 0.7689
## F-statistic: 187.4 on 4 and 220 DF, p-value: < 2.2e-16
Nevertheless, what makes our dataset different is its nested data structure: units are clustered into subgroups that are also nested within larger groups. We use a plot to indicate our dataset’s structure:
As shown above, the dataset can be devided into two levels: county level (L1) and school level (L2). Such structure is common in human and biological science. For instance, in educational settings, students are clustered into schools and schools are clustered nto districts, which are part of a larger system (county, state or country). The subgroup-specific characteristics may have some potential influence on outcomes. Another typical example can be longtitude data, a variable’s values over time are correlated with each other. In this case, linear regression fail to capture the influence of clustering so we need a more nuanced analysis.
Multilevel models (MLM) are statistical models of parameters that vary at more than one level. To be specific, a two-level model allows for variation in both school level and county level, which cannot be accounted for in traditional ordinary least squares regression. County level variation, also called ‘county effects’, represents a source of dependence that influence schools’ performance. MLM always includes fixed and random parameters. Fixed parameters are composed of a constant over all the groups, whereas a random parameter has a different value for each of the groups.
This section will go over how to run MLM for our dataset in R, and focus on estimating fixed and random parameters as well as how to interpret the output. This section will build fairly simple and slightly more complex models, as the primary focus of this section is on proper model building and interpretation rather than an exhaustive review of highly complex MLM models. This section will also include a short review of different library options for HLM analyses.
First, we will begin with an unconditional model, which will provide a general basis for more sophisticated models to follow. This unconditional model asks the question, ‘Is there variability in Algebra 1 achievement across county?’
m1 <- lmer (pro ~ 1 + 1|county, data=df_ALG1)
Function lmer() in package lme4 is a powerful tool to fit MLM. Package lme4 will allow you to analyze data using restricted maximum likelihood estimation (REML) rather than ordinary least squares (OLS), making it ideal for MLM analyses.